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SYSTEM AND METHOD FOR PROVIDING MULTI- RESOURCE MANAGEMENT 

SUPPORT IN A COMPUTE ENVIRONMENT 

PRIORITY CLAIM 

[0001] The present applicaiioii claims priodtj' to U.S. Provisional AppLicatioa No. 60/552,653 filed 
March 13, 2004, the con teats of which are incor]:>orated herein by reference. 

RELATED APPLICATIONS 

[0002] The prescnc application is related to Attorney Docket Numbers 010-0011, 010-0011 A, 010- 
001 IB, 010-001 IC, 0UV0019, 010-0026, 010-0028 and 010-0030 filed on the same day as the 
present application. The content of each of these cases is incoqDorated herein by reference. 

BACKGROUND OF THE INVENTION 

1. Field of the Invention 

[0003] The present invention relates to managing resources in a compute en^nronment and more 
specifically to a system and method of querying and conrxolUng resources in a compute 
environment such as a multi-cluster enviroiametit. 

2. Introduction 

[0004] Grid computing m-ay be defined as coordinated resource sharing and problem solv^ing in 
dynamic, multi-institutional collaborations. Many computing projects require much more 
computational power and resources than a single computer may provide. Networked computers 
with peripheral resotirces such as printers, scanners, I/O devices, storage disks, scientific devices 
and instrument's, etc. may need to be coordinated and utiii:scd to complete a task. 
[0005] Grid/ cluster resource management generally describes the process of identifying 
requirements, matchiaig resources to applications, allocating those resources, and scheduling and 
monitoring grid resources over time in order to run grid applications as efficiently as possible. Each 
job subiTiitted for processing will utilize a different set of resources and thus is typically unique. In 
addition to the challenge of allocating resources for a particixlar job, gxid administrators also have 
difficulty obtaining a clear understanding of the resources available, the current status of the grid 
and available resources, and real-time competing needs of various users. General background 
information on clusters and grids may be found in several publications. See, e.g.^ Grid Resource 
Management. State of the Art and Future Trends . ]arek Nabrzyski, Jennifer M. Schopf, and Jan 
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Wcglar/, Kluwcr Academic Publishers, 2004; and Bcowoalf Clusrcf Computing wiih Linux , edited by 
William Gropp, Ewing Lusk, and Thomas Sterling, Massachusetts Institute of Technology, 2003. 
[0006] It is generally understood herein that the terms grid and cluster are interchangeable in that 
there is no specific definition of either. The definition of a grid is vcr\' flexible and may mean a 
number of different configurations of computers. The introduction here is meant to be general 
given the variety of configurations that are possible. In general, a grid will comprise a plurality of 
clusters as is shown in FIG. 1. Several general challenges exist when attempting to maximize 
resources in a grid. First, there are typically multiple layers of grid and cluster schedulers. A grid 
too generally comprises a group of clusters 110, 112 or a group of networked computers. A grid 
scheduler 102 communicates with a plurality of cluster schedulers 104 A, 104B and 104C. Each of 
these cluster schedulers communicates with a plurality of resource managers 106A, 106B and 106C. 
Bach resource manager communicates with a series of compute resources shown as nodes 108A , 
108B, 108C within duster 1 10 and 108D, 108E, 108F in cluster 1 12. 

[0007] Iliexe arc various vendors of resource managers I06A, 106B, 106 C that require differing 
means of communicarion to and from the resource manager. For example, one resource manager 
vendor may have software that only communicates with certain types of compute resources, such as 
Microsoft or Linux operating systems, hard drives using certain protocols and so forth. This can 
cause challenges when a cluster includes a variety of cluster resource managers in order to 
commimicate with these differing products. 

[0008] Other challenges of the model shown in FIG. 1 exist as well. Ix>cal schedulers (which may 
refer to either the cluster schedulers 104 or the resource managers 106) are closer to the specific 
resources 108 and may not allow grid schedulers 102 direct access to the resources. Examples of 
compute resources include data storage devices such as hard drives and computer processors. The 
grid level scheduler 102 typically does not own or control the actual resources. Therefore, jobs are 
submitted from the high level grid-scheduler 102 to a local set of resources with no more 
permissions that then user would have. This reduces efficiencies. 

[0009] Another issue that reduces efficietic}' is due heterogeneous nature of the shared resources. 
Without dedicated access to a resource, the grid level scheduler 102 is challenged with the liigh 
degree of variance and unpredictability in die capacit)? of the resources available for use. Most 
resources are shared among users and projects and each project varies from the other. 
[0010] Furthermore, the performance goals for projects differ. Grid resources^are used ro improve 
performance of an applicauon but the resource rwners and users have different performance goals: 
from opdmizing the performance for a single applicadon to getdng the best system throughput or 
minimizing response time. Local policies may also play a role in performance. 
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[001 1] A bottleneck in the middleware component of the cluster or grid environment is the 
difficulty in communication and managing the diverse compute resources. If the clusters 1 10 and 
112 include a group of compute resources that have diverse communication protocols and 
requirements across different resource managers 106A, 106B, 106C, then a cluster manager or 
company establisliing or provisioning a cluster may need to purchase or licetise multiple resource 
managers to communication and assign jobs to all the compute resources. 
[0012] Figure 2 illustrates the prior art in more detail. Typically the cluster scheduler 104A 
communicates with only one resource manager server 106A. 'Ilie resource manager sender 106A 
retrieves data from a number of resource manager clients 120A, 120B, 120C that are located on 
each of the compute hosts 122A, 122B, 122C v^'ithin a respective cluster node 108 A, 108B, 108C. 
The clients 120A, 120B, 120C obtain infomiation regarding the state of die node 108A, 108B, 108C, 
the load on the node, the configuration of the node, and similar properties. 
[(K)13j Tliis model is acceptable for simple clusters but it has a number of problems. A primary 
problem in tliis scenario is that the cluster scheduler's view of the compute resources is bound by 
the resource manager serv-cr 106A and it is not able to obtain any information that is not contained . 
within the resource manager 106A. The cluster scheduler 104A is not able to take any action that is 
not within the scope of the communication and control capabilities of the resource manager 106 A. 
For an end user or a customer site, they are bound to select a single resource manager and are 
unable to pick and choose one resource manager combined with the features of another. The 
standard resource manager 106 A is designed on a locked-in model and purchasers must simply find 
the best resource manager with the combination of features and purchase it or look to purchasing 
and incorporating multiple and diverse resource managers to meet all their needs. 
[0014] What is needed in the art is a way of allowing a scheduler to contact multiple soiurces of 
information and multiple sources of control so diat an end user site can pick and choose which 
groupings of sen^ices to utilize from muldple sources. What is further needed in the art is an 
improved mediod and system for managing the disparate compute resources at a single location 
widiin the context of a cluster environment where there are a variety of cluster resource managers 
and a variety of compute resovirces. 

SUMMARY OF THE INVENTION 

[0015] Addiuonal features and advantages of the invendon will be set forth in the descripdon 
wliich follows, and in part will be obvious from the description, or may be learned by practice of the 
invendon. Tlie features and advantages of the invendon may be realiised and obtained by means of 
the instruments and combinadons particularly pointed out in the appended claims. These and other 
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features of the present invention will Ixrcome more fully apparent from the following description 
and appended claijns, or may be learned by the practice of the invention as set forth herein. 
[(X)16] Tlie present invention addresses the need in the art for a workload manager that can obtain 
configuration and st^ite data and can control the state and activities of muliiple resources in a cluster 
from multiple services. A workload manager utilises the serv ices of a resource manager to obtain 
information about the state of compute resources (nodes) and workload (jobs). Ilie workload 
manger can also use the resource manager to manage jobs, passing instructions regarding when, 
where, and how to start or otherwise manipulate jobs. Using a local queue, jobs may be migrated 
from one resource manager to another. 

[0017] The invendon comprises systems, methods and computer-readable media for providing 
muldplc-resource management of a cluster environment. The method embodiment of the 
invention comprises, at a cluster or grid scheduler, defining a resource management interface, 
identifying a location of a plurality of services within die cluster environment, determining a set of 
ser\'ices available from each of the pluralit)'. of resource managers, ^selecting a group of ser\-ices 
available from the pluralit}^ of resource managers, contacting the group of services to obtain Ml 
information ass<x:iated with the computer environment and integrating the obtained full 
information into a single cohesive world-view of compute resources and workload requests. 

BRIEF DESCRIPTION OF THE DRAWINGS 

[0018] In order to describe the maixner in which the above- recited and other adv^ant^iges and 
features of the invention can be obtained, a more particular description of the invention briefly 
described above will be rendered by reference to specific embodimcnti? diercof which are illustrated 
in the appended drawings. Understatiding that these drawings depict only typical embodiments of 
the invention and are not therefore to be considered to be limiting of its scope, the invention will be 
described and explained with additional specificit)'^ and detail through the use of the accompanying 
drawings in which: 

[0019] FIG. 1 illustrates geiierally a grid scheduler, clvister scheduler, and resource managers 
interacting with compute nodes; 

[0020] FIG. 2 illustrates in more detail a singe cluster scheduler communicating with a single 
resource manager; 

[0021] FIG. 3 illustrates an architectural embodiment of the invention; 
[0022] FIG . 4A illustrates a method embodiment of the invention; 
[0023] FIG. 4B iUusuates another method aspect of the invention; 

4 



PAGE 5/33 • RCVD AT 10/15/2008 3:11:37 PM [Eastern Daylight Time] • 8VR:USPTO-EFXRF-5/40 • DNIS:2709973 • C8ID:1^10-510-1433 " DURATION (min-ss):21-14 



To: T. Johnson Vessels Page 6 of 33 



2008-10-15 19:11:39 (GMT) 



1-410-510-1433 From: Thomas M. Isaacson 



Attorney Docker: 0K)-(K)13 

[0024] FIG. 5 illustrates a schedulcx communicating with a plurality of compute resource managers 
• managing a cluster; and 

[0025] FIG. 6 illustrates a scheduler communicating with a plurality of compute resource managers 
each managing a separate cluster that comprises a grid. 

DETAILED DESCRIPTION OF THE INVENTION 

[0026] Various embodiments of the invention are discussed in detail below. While specific 
implementations are discussed, it should be understood that tins is done for illustration purposes 
Qfily- -A person skilled in the relevant art will recognize that other components and configuradons 
may be used without parting from the spirit and scope of the invention. 

[0027] The present invcndon addresses the deficiencies in the prior art and provides systems and 
methods for querying and controlling state and configuration iiiformation for cluster resources. 
'I'he "systcnn" embodiment of the invention may comprise a computing device that includes the 
necessary hardware and software components to enable a workload manager or a software module 
performing the steps of the invention. Such a compudng device may include such known hardware 
elements as one or more central processors, random access memory (RAM), read-only memory 
(ROM), storage devices such as hard disks, communication means such as a modem or a card to 
enable networking widi other computing devices, a bus that provides data transmission between 
various hardware components, a keyboard, a display, an operating system and so forth. There is no 
restriction that the particular system embodiment of die invention has any specific hardware 
compotients and any known or future developed hardware configurations are contemplated as 
within the scope of the invention when the computing device operates as is claimed. The systeiTis 
aspect of the invention may also comprise clusters, grids, servers, utility based computing centers, 
hosting center, data centers and w-orkload xiianagers or schedulers that perform the steps of the 
invention. 

[0028] A cluster scheduler using the ]>rinciples of this invention is not restricted to any particular 
resource manager. The term cluster scheduler may refer to a workload manager, a grid scheduler, or 
any scheduling or nianagement component iti any layer within a grid or cluster system. Preferably, 
the cluster schedule is feature 302 of FIG. 3, but the functionality of the scheduler may operate at 
other layers as well. 

[0(029] llie scheduler can receive and process multiple sources of information for more intelligent 
resource management and scheduling. Within a cluster, grid or utility-based computing 
environments, decisions on what resources to use and when to use them are often determined by 
information provided by a compute resource manager such as Loadleveler, LSF, OpenPBS, 
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TORQUE, PBSPro, BProc and others which provide basic status about a node and which can be 
used to submit job requests to a compute node. Information on each of these produces and other 
resource managers may be obtained from tlie Internet or from the vendors, I'lie present invendon 
enables support for tnultiple standard resource manager interface protocols based on these different 
products. Those of skill in the art will recognise how these compute resource managers interact 
with the compute resources and cluster or grid schedulers. 

[(X)30] Other beneSts that are available through use of the present invention include vsupport for 
muldple simultaneous rcsoturce managers, A scheduler using the principles of the present inv^endon 
can integrate resource and workload streams from multiple independent sources reporting disjoint 
sets of resources. The scheduler according to an aspect of the invendon can allow one system to 
manage a workload (queue manager) another to manage your resources. The scheduler can support 
rapid development interfaces with load resource and workload informadon obtained directly from n 
file, a URL, or from the ourjDut of a configurable script or other executable. The scheduler can also 
provide resource extension informadon by integrating infortTiadon from muldple sources to obtain 
a cohesive view of a compute resource (i.e., mix information from NIM, OpenPBS, Flexl-M, and a 
cluster performance monitor to obtain a single node image with a coordinated state and a more 
extensive list of node configuration and utiliicadon attributes). The inventioti enables support for 
generate resource manager interfaces to manage cluster resources securely via locally developed or 
open source projects using simple flat text interfaces or XML over H lTP. lliese benefits and 
advantages are not considered limitauons on the principles of the invention unless expressly set 
forth in the claim set below. 

[0031] Compute resource managers like those introduced above are able to provide node state 
informadon to help make workload decisions. The inclusion of additional resource tnanagers that 
collect addidonal t\-pes of informadon such as attributes and statiis of storage resources, or software 
licenses or network bandwidth, files sizes, or any number of data points about any type of resource, 
whether it be consumable or non-consumable, can be incorporated into resource policy and 
scheduling decisions by a given tool. Resource managers may be formally known as such or d^ey 
may be as sitnple as a one or two-line script designed to gather a particular piece of information. 
The ability to manage multiple types of resources simultaneously to make more intelligent policy 
and scheduling decisions as well as report additional information is one aspect of multi-resource 
management. 

[0032] With refexence to FIGS. 3 and 4, the basic principles of the systejn and method 
embodiments of the invention are introduced. In the context of this disclosure, each of the 
ser\'iccs 306, 308, 310, 312 and 314 may be referred to generally as a "resovuce manager", or rnore 
specifically according to the particular service they provide, such as a prcivision manager 306 thai 
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manages provisioning- related processes on the cluster or a storage managex 312 that coinmunicates 
with and tnanages data storage 320. One aspect of the invention comprises within a cluster 
scheduler 302, defining a resource management interface (402). 

[0033] llic cluster scheduler 302 must interface with numerous resource management systems 310. 
Some of the resource managers 310 interact dirough a resource manager specific interface (Le., 
OpenPBS/PBSPro, I^)adleveler, SGE) while others interact through generalized interfaces such as 
SSS or Wiki. Those of skill in the art will understand these generalized interfaces and no more 
detail is provided herein. 

[0034] For most resource managers, either route is possible depending on wliere it is easiest to 
focus development effort. Use of Wiki generally requires modifications on the resource manager 
side while creation of a new resource manager specific interface would require more changes to 
modules in the cluster scheduler. If a scheduling API already exists within the resource, manager, 
Creadon of a resource manager specific scheduler interface is often selected. 
[0035] Tlie specific interfaces for several resource managers are discussed next to provide an 
example of how to define the resoiucc management interface. If the resource manger specific 
interface is desired, then typically a scheduling API library/ header file combo is desirable (i.e., for 
PBS, libpbs.a + pbs_iQ.h, etc.). This resource manager-provided API provides calls which can be 
linked into the workload manager 302 to obtain the raw resource manager data including both jobs 
and compute nodes. Addiuonally, diis API should provide policy informauon about the resource 
manager configuration if it is desired that such policies be specified via the resource manager rather 
than the scheduler and that the workload manager 302 will know of and respect these policies. 
[0036] A UKxlule (such as the 'M<X>I.c* module) is respoiisible for loading informauon from the 
resource manager, translating this infomiation, and then populating die appropriate scheduler data 
srructiures, llie existing modules (such as the MLJ.J.C, MPBSI.c and MWikil.c modules) pro\ddc 
templates indicating how to do this. These may be obtained from the Internet or die respective 
vendor. 

[0037] There are two steps associated with configuring an interface in a scheduler 302. The first 
step involves mentioning or identifying the location of various services in die environment (404) 
and a port and a protocol being used. Tlie cluster environment may be at least one of a local area 
grid, data centers, wide area grid, cluster scheduler utility based computing environment and hosted 
centers. The host, port and ser\'er attributes can be used to specify how the resource manager 320 
should be contacted. For many resource managers (le., OpenPBS, PBSPro, Ixjadleveler, SGE, 
LSF, etc) the interface correcdy establishes contact using default values. These parameters need 
only to be specified for resource managers such as the WIKI interface (which do not include 
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defaults) or with resources managers which can be configured to run at non-standard locations 
(such as PBS). In all other cases, the resource manager is automatically located. 
[0038] The maximum amount of time the scheduler 302 will wait on a resource manager 310 call 
can be controlled by a timeout parameter which defaults to 30 seconds or any other desired amount, 
llie authtypc attribute allows specification of how security over the scheduler/resource manager 
interface is to be handled. 

[0039] Anodier resource manager configuration attribute is CC!)NFIGF1LE, which specifies the 
locadon of the resource manager*s primar)^ con fig file and is used when de(:ailcd resource manager 
information not available via the scheduling interface is required. The NNfPORT attribute allows 
specification of the resource manager's node manager port and is only required when this port has 
been set to a non-default value. 

[0040] FIG. 3 shows an example of v^arious services, such as a provisioning manager 306, a node 

V 

manager 308, the resource manager 310, a storage manager 312 and a network manager 314. Other 
ser\'ices not shown are also contemplated, such as software licensing managers or script gathering 
managers. A second step is defining a type for each ser\'ice (406). If the ser\'ice under 
consideration is the resource manager 310, then the tjqDC may be the vendor or which resource 
manager is managing the cluster. Defining the type in this regard is accomplished by modif\-ing a 
header file (such as moab.h) and a source code file (such as the MConsix file) to define a new 
RM'IlcTE parameter value For example, the resource manager type may be PBSPro, TORQUE or 
others. With this defined, a module (such as the MRMI.c module) must be modified to call the 
appropriate resource manager specific calls which will eventually be created within the 'M<X>l.c' 
module. This process is straightforward and involves extending existing resource manager specific 
case statements widiin the general resource manager calls. 

[0041] Tlie resource nianager specific data collection and job management calls play an important 
role in the cluster scheduler 302. These calls populate data structures and are responsible for 
passing scheduler 302 scheduling commands on to the resource manager 310. The base commands 
arc Getjobs, GetNodcs, Startjob, and Cancel) ob but if the resource manager support is available, 
extended funcdonality can be enabled by creating commands to suspend /resume jobs, 
checkpoint/ restart jobs, and/or allow support of dynamic jobs. 

[(X)42] If the resource manager 310 provides a form of event driven scheduling interface, this 
feature will also need to be enabled. A module (such as the MPBSLc module) provides a template 
for enabling such an ijiterface within the MPBSProcessEveniQ call. 

[0043] The Wiki interface is a good alternative if die resource manager docs not already support 
some form of existing scheduling API. For the most part, use of this API requires the same 
amount of effort as creating a resource manager specific interface but development effort focused 
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within the resource manager. Since Wild is already defined as a resource manager vypCy no 
modiGcations are required within the cluster scheduler 302, Additionally, no resource manager 
specific library or header file is required. Howe\'er, within the resource manager 310, internal job 
and node objects and attributes must be manipulated and placed widiin Wiki-based interface 
concepts as defined by the interface. Additionally, resource manager parameters must be created to 
allow a site to configure this interface appropriately. 

[0044] 'Ilic SSS interface is an XMI..-based generalised resource manager interface. It provides an 
extensible, scalable, and secure method of qucr\'ing and raodifyitig general workload and resource 
information. 

[(X)45] Defining the type of service may also mean, in a nriore general sense, which "type" of service 
is beijig analyzed. For example, in FIG. 3, there is a provisioning manager 306, node monitor 308, 
resource manager 310, storage manager 312 and network manager 314, The type may indicate the 
kind of service or monitoring that the particular sendee provides and any associated data. 
[0046] A type scopes the information coming back to the scheduler 302 via the resource manager 
310, and it also scopes, or constrains, the t)'pes of functionality that are available thac can be 
provided by the resource manager 310 to the scheduler, within that resource manager object. 
[0047] Another aspect of idenufying the type relates to what type of service is being identified and 
categorized, such as a provisioning sennce, network manager, licensing ser\-ice, and so forth. 
[0048] 'Hie steps of defining a resource management interface and identifying locations of a . . 
pluraiit)' of sen'ices (404) and defining t)'pes (406) can either be performed manually by a network 
administrator or automatically by communicadng between the scheduler and a directory ser\ ice 
ser\'er 304. In die automatic model, the directory ser\^ice 304 receives reports of their configuration 
and capabilities for each resource manager 310 and each service 306, 308, 312 and 314 instead of 
being configured in the scheduler 302. Tlie scheduler 302 comtiiutiicaies with the directory serv^ice 
304 and determines the types of serv^ice that the directory service has detected and idendfied. 'ilie 
scheduler than selects the desired scr\dce and the directory ser\ace 304 contacts the ser\aces 306, 
308, 310, 312 and/or 314 which can enable the scheduler 302 to schedule and manage the cluster. 
[0049] The scheduler 302 interacts with all resource managers 310 (and other services) using a 
common set of commands and objects. In the simplest configurauon, the four priman- fvinctions 
are GETJOBINFO diat collects detailed state and requirement informadon about idle, rainning, and 
recently completed jobs.; GEl'NODEINFO that collects detailed state informadon about idle, 
busy, and defined nodes; STAR'IJOB that immediately starts a specific job on a particular set of 
nodes; and CANCEIJOB Immediately cancel a specific job regardless of job state. Using these 
four commands, die scheduler 302 enables an entire suite of scheduling functions. In addition to 
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these base commands, other commands may be udlized to support features such a dynamic job 
support, suspend/resume, gang scheduling, and scheduler inidaced checkpoint restart. 
[0050] Each resource manager interfaces, obtains and translates scheduler 302 concepts regarding 
workload and resources into native resource manager objects, atrxibutes and commands. 
[0051] This is an automadc method using director)' scrv'cr 304 to communicate with each of the 
resource managers. In a simple configuration it is preferable to do the manual approach in a more- 
complex configuration and a more dynamic situation using the directory ser\'er 304 method is 
preferable. 

[0052] Once all the resource managers 306, 308, 310, 312 and 314 are configured, the scheduler 
302 proceeds to obtain a full world view by pulling in information and workload requests from die 
various resource managers 306, 308, 310, 312 and 314 (412) and then performs a pluralit)' of casks 
based on policies and needs (414). Queue 322 illustrates a job queue with a workload that may be 
submitted at the cluster scheduler 302 layer, the grid scheduler layer 102 or even lo the resource 
manager 310. An example of a job submitted for processing on the cluster is a weather analysis on 
an upcoming tropical storm. A weather bureau may schedule the cluster every night at midnight for 
a large amount of processing resources to compute their weather analysis. The jobs may have 
policies and needs related to processing power, a required time to complete the job (for reporting 
the news at a certain time), and so forth. If the job is submitted to the scheduler 302, the 
scheduler 302 retrieves information from the various resource managers 306, 308, 310, 312, 314 and 
checks the queue 322, gathers all its information, and then instructs the resource manager 310 to 
process the job according to the direction from the scheduler 302 based on all the received 
information, policies and pennissions. This processing may include re-provisioning die cluster 
environinent or modifying the workload to meet die capabilities of the resources. 
[0053] FIG. 4B illustrates another aspect of the method embodiment of the invention. In this 

* 

aspect, a mediod of providing multiple-source resource managcmetit of a com]3ute environment 
comprises defining a resource management interface (420), identifying a location of each of a 
plurality of resource managers within the compute environment (422), determining a set ol services 
available from each of the plurality of resource managers (424), selecting a group of sen iccs 
available from the plurality of resource managers (426), contacting the group of resource managers 
to obtain full information associated with the compute environment (428) and integrating the 
obtained full information into a single cohesive world -view of compute resources and workload 
requests (430). 

[(X)54] In the context above, determining a set of "services" available from each of the pkirality of 
resource managers may also refer to detertnining a set of services and/or data available from each 
of a plurality of resource managers. As a system performs the systems set forth herein, each of die 
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steps may be considered as being practiced by a "module" such as computer software code that is 
programmed to generally perform the recited step. 

[0055] A lot of the services in FIG. 3 can modify the cluster to meet the needs of die workload , 
322. The provisioning manager 306 can change opcradng systems and software applications on 
nodes in the cluster. A node monitor 308 provides information about each node 108A, 108B, 
108C, 108D upon a query. It simply returns the informadon as a query-only type object, llie 
storage manager 312 will connect file systems that are required for a job. 

[0056] For example, suppose a job requires a massive data set for storage and nodes 108 A, 108B 
are allocated for the job, the storage manager 312 will associate that storage requirement that has 
the images needed widi the assigned nodes 108 A, 108B for that job. The network manager 312 can, 
for example, provide a virtual private network for security and a guaranteed bandwidth via a router 
316 for inter-process communicadon. The network-manager 314 communicates with the router 
316 and die router 316 communicates with the nodes 108A, 108B, 108C, 108D in the cluster. The 
architccaire in FIG. 3 is t}'pically a single cluster unless the context is \viihin a virtual private cluster, 
in which case it could simply be a partition witliin the larger cluster. 

[0057] Table 1 illusuates example resource manager commands that relate to the interaction of the 
scheduler 302 with the resource manager 310. 
Table 1 



\ ObjectfFunction 



Details 



_. 



Query 



i Collect detailed state, requirement, and utilization information about jobs 



Job 



Modify 



::j Change job state and/or attributes 



Start 



;j Execute a job on a specified set of resource 



•1 Job 



Cancel 



Cancel an existing job 



Job :i Preempt/Resume; j Suspend, Resume, Checkpoint, Restart, or Requeue a job 



1-'°'^ ...1, 



Node ;i Query 



I Collect detailed state, configuration, and utilization information about compute 



resources 



ij Node ;j Modify 



j Change node state and/or attributes 



Queue Query 

'-s . .. . 



i Collected detailed policy and configuration information from the resource 

il manager 



[0058] Using these functions, the scheduler 302 is able to fully manage workload, resources, and 

cluster policies. Beyond these base functions, other commands exist to support advanced features 

such a dynamic job support, provisiomng, and cluster level resource management. 

[0059] In general, the scheduler 302 interacts with resource managers 310 in a sequence of steps in 

each scheduling iteration. These steps comprise: loading global resource information, loading node 

specific information (optional), loading job information, loading queue /policy information 

(optional), cancel/preempt/modifying jobs according to duster policies, starting jobs in accordance 
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with available resources and policy conscraincs and handling user coinniands. Typically, each step 
completes before the next step is started. However, vvitli current systems, siite and complexity 
mandate a more advanced parallel approach providing benefits in the areas of reliability, 
concurrency, and responsiveness. 

[0060] A number of the resource managers Moab interfaces to were unreliable to some extent. 
This resulted in calls to resource management. API's with exited or craslied taking the entire 
scheduler with them. Use of a threaded approach would cause only the calling thread to fail 
allowing the master scheduling thread to recover. Additionally, a number of resource manager calls 
would hang indefinitely, locking up the scheduler. These hangs could likewise be detected by the 
master scheduling thread and handled appropriately in a direaded cnvirontncnt. 
[0061] As resource managers grew in size, the duradon of each API global query call grew 
proportionally. Pardcularly, queries which required contact with each node individually became 
excessive as systems grew into the thousands of nodes. A threaded interface aliow^s the scheduler 
302 to concurrently issue multiple node queries resulting in inuch quicker aggregate RM query 
umes. 

[0062] In the non-threaded serial approach^ the user interface is blocked while, the scheduler 
updated various aspects of its workload, resource, and queue state. In a threaded model, the 
scheduler could continue to respond to queries and other commands even while fresh resource 
manager state information was being loaded resulting in much shorter average response dmes for 
user connimands. 

[0063] Under die threaded interface, all resource manager informauoti is loaded and processed 
while the user interface is still acdve. Average aggregate resource manager API query times are 
tracked and new RM updates are launched so that the RM query will complete before the next 
scheduling iteration should start. Where needed, the loading process uses a pool of workex threads 
to issue large numbers of node specific information queries concurrently to accelerate this process. 
The master thread condnues to respond to user commands until all needed resource manager 
information is loaded and cither a scheduling relevant event has occurred or the scheduling iteration 
dme has arrived. At this point, die updated information is intcgtated into tJie scheduler's 302 state 
informauon and scheduling is performed. 

[0064] The present invention allows distribudon of data probes and distribution of acdvators and 
cluster funcuonality such that, one can have piecemeal upgrades to the cluster environment and 
piecemeal improvements without having to remove and replace die entire system. Other 
advantages include the capability of bringing in non-standard cluster resources. An administrator 
can define new resources in a matter of minutes and have those resources manipulated and 
scheduled to receive jobs, llie administrator can add new network management directly for a 
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router, regardless of what the resource manager 310 supports and diat network management service 
can be incorporated into the overall cluster management, scheme. 

[0065] The administrator can pick and choose the components provided by various vendors or 
open source solutions and get exactly what is needed. The components do not need to be vendor 
specific to communicate with a particular proprietary resource manager 310. The present invenuon 
enables a free cluster management environment and if there are pieces of information need to be 
provided by a particular ser\acc or resource scheduler, the present invenrion utilizes that 
informadon to seamlessly incorporate the various services and resources for control from a single 
location. 

[(X)66] Examples of various resource management vendors or products include pro\'isioning 
management services such as Red Carpet^ Cluster Systems Management (CSM) from IBM, and 
open source tools or other muldple other automatic node configuration rnanagement systems. Such 
available serxices may also be connected directly in with RAM root and an NPS system to simply 
reboot (and nothing more). Ser\'ices such as the provisioning manager 306 could cither be one of 
those tools mentioned above of it could be 5 lines or ten lines of Perl code that has the same 
effective acuon. 

[0067] Another advantage to the multi-resource management invention is that many of these 
ser\aces can be written by hand or written in scripts in a few hours to meet specific needs and thus 
enable new funcrionaUty and new policies. For example, the flexibility of the present invention 
enabled a network manager created from one Une of source code to retrieve additional information 
about network load that is perti*ient to scheduling. That information is made available to the 
scheduler according to the principles of the present invenrion but would not be available to other 
resource managers. 

[0068] With a storage manager 312, each of the major storage systems 320 provides some level of 
storage management but in the prior art these are not integrated into any resource manager 310. 
Therefore, if an administrator wanted storage management, there was no previous manner of 
performing that task via a resource manager. Whereas with the tnodel disclosed hcreiti, the 
administrator can obtain a storage manager 312 from die storage management vendor and can 
interface with his tool and allow the scheduler 310 to intelligejitly manage the data and the storage 
right along with the submitted jobs. 

[0069] 'Die scheduler 302 can create a resource manager by simply writing a script and having it 
executed by loading that informadon straight into a file, by having the information put into an SQL 
database, or even making the information available via a webservice, and having the scheduler 302 
obtain the information directly o\'er the Internet 
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[0070] Attother aspect of the present invention relates to on-demand services. The current state of 
the art is when resources are being re-provisioned or reconfigured, certain ser\-ices will unavailable 
for a portion of that time frame. For example, a resource management client 120A, 120B, 120C, 
120D would be down on diat respective node 108A, 108B, 108C, 108D while it is being re- 
provisioned. With the scheduler disclosed herein having the mulii- resource matiager capability, 
multiple sources can provide redundant information so that while one resource manager 310 is 
down you havx another source of iiiformation that rcporti; tJiat this node is being rc-provisioncd 
and moxang forward on the desired path, 'fhis redundant information not ovjXy handles false but it 
also handles these situations in which not all resource managers can report all information at all 
times. 

[0071] FIG. 3 includes nodes 108 A, 108B, 108C and 108D which can be defined as a single cluster. 
Another aspect of the multi-resource manager is how it operates on a mulu-cluster platform. 'Iliis 
aspect is shown by way of example in FIG. 5. The present invention supports multiple resovirce 
managers in order to increase scalabilit)? of what a resource manager can manage within multiple 
partitions of a very large cluster. Compxue resource managers such as Loadleveler, LSF, OpenPBS, 
TORQUE, PBSPro, etc., may have a scalability limitation, in that a single instance of a coinpute 
resource manager may only be able to manage w^orkload submission and node status tracking across 
a limited number of compute nodes. For example, compute resource manager 504 can manage job 
submission of up to 1,000 compute nodes. ITie customer has a 3,000 compute node clustex 510. 
ITie customer may set up 3 separate itistances 504, 506, 508 of die compute resource manager to 
manage 1,000 compute nodes each. Then using this aspect of multi-resource management, a tool is 
used to manage workload, policies, scheduling, monitoring and reporting across all three compute 
resource managers as though tJiey were one. Thus a user 502 may submit a job that requires more 
than 1,000 nodes to the scheduler 302 without the freedom to know diat the schedtilcr will manage 
the job correctly over the cluster, even given the limitations of the individual resource managers. 
[(X)72] Inquiries or requests that are placed from any one of the three portions of the overall 
duster, or directly to the centralized tool, can be applied any one of the three portions of die cluster 
no maitex where they originated. Policies on the centralized tool however could set rules that 
disallowed work requests from one portion to be applied to another portion or any number of 
policies based on the origin and destination of the request. ' 

[0073] Another aspect of the present invention is its support for multiple resource managers in 
order to manage multiple clusters in a local area grid from one tool. Consequently, the resource 
management configuration parameter (RlvlCFG) takes an index value, i.e., RMCFG [cluster A] 
1VPE=PBS. This index value essentially names the resource manager (as done by die deprecated 
parameter RMNAME.) Hie resource manager name is used by the scheduler in diagnostic 
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displays, logging, and in reporting resource consumption to the allocation manager. For most 
environments, the selection of the resource manager name can be arbitrar)' or not. 
[0074] Tliis feature is illustrated in FIG. 6. Wlien an organization has multiple clusters 604, 606^ 
608 and would like to use those clusters in a more aggregated manner, a tool may be used to 
combine the management of resources on each cluster through its local compute resource manager 
610, 612, 614. In this example, each resource manager is of the same type. An example of this 
would be an organizarion that has 3 clusters at a similar location, each being used by a different 
group. Each cluster 604, 606, 608 is using a respective compute resource manager 610, 612, 614. 
The workload manager 302 of the present invention may be used to provide unified management of 
all clusters and can allow any user from any one cluster to submit work to any other cluster, or to 
submit the work to the local area grid (pool of all three clusters combined or a pordon of all three 
clusters) in which case the workload manager would make the decision as to where best to submit 
the job request. This creates a local area grid 602. 

[0073] Another aspect of the workload manager is its support for multiple resource managers of 
different rt'pes for purposes of managing a heterogeneous set of clusters. This aspect is very sunilar 
to that of the aspect described above with reference to FIG. 6, although this is done also for the 
purpose of combiiiing muldple tj pes of compute resource managers. Therefore, in this example, 
resource managers 610, 612, 614 arc of different types to allow different groups that have and 
ittvesttnent or an affinity to a particular resource manager to continue with that investment or 
affinity while still providing a unified management mechanism through the present invention. 
[0076] An issue that is raised with die capabilities of the present iiivention is how to handle 
conflicting information regard a node, job or policy. ITie scheduler 302 in this regard does nor trust 
resource manager 310 information. I'o obtain the most reliable information, node, job, and policy 
information is reloaded on each iteration and discrepancies arc detected. Synchronisation issues 
and allocation conflicts are logged and handled where possible.. To assist sites in minimizing stale 
information and conflicts, a number of policies and parameters are available: Node State 
Synchronisation Policies (such as NODESYNCTIN'EE); Job State Synchronisiation Policies (see 
JOBS\TMCnME); Stale Data Purging (sec J OBPURGETIME); Thread Management preventing 
resource manager failures from affecting scheduler operation); Resource Manager Poll Interv al (see 
RJVIPOLIJN'niRVM.); Node Query Refresh Rate (see NODEPOIXFREQUENC^'). 
[0077] Embodiments within the scope of the present invention may also include computer^ 
readable media for carrying or having computer-executable instructions or data structures stored 
diereon. Such computer- readable media can be any available media that can be accessed by a 
general puq>ose or special purpose computer. By way of example, and not limitation, such 
computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk 
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Storage, magnetic disk storage or otlicr magnetic storage devices, or any other medium which can be 
used to carr)' or store desired program code means in the form of computer-executable instructions 
or data structures. When informauon is transferred or provided over a network or anodier 
communications connecdon (either hardwired, wireless, or combination thereof) to a computer, the 
computer properly views the connection as a computer-readable medium. 'i*hus, any such 
connection is properly termed a computer- read able medium. Combinations of the above should 
also be included within the scope of the computer-readable media. 

[0078] Computer-executable instructions include, for example, instructions and data which cause a 
general purpose computer, special purpose computer, or special purpose processing device to 
perform a certain function or group of functions. Computer-executable instructions also include 
program modules that are executed by computers in stand-alone or network environments. 
Generally, program modules include routines, programs, objects, components, and data structures, 
etc. that perform particular tasks or implement particular abstract data types. Co mputer-execvi table 
instructions, associated data structures, and program modules represent examples of the program 
code means for executing steps of the methods disclosed herein. The particular sequence of such 
executable instructions or associated data structures represents examples of corresponding acts for 
implementing the functions described in such steps. 

f(X>79] Those of skill in the art will appreciate diat other embodiments of the iiwention may be 
practiced in netw^^ork computing environments with many types of computer system configurations, 
including personal computers, hand-held devices, multi -processor systems, microprocessor-based or 
programmable consumer elecaonics, network PCs, minicomputers, mainframe computers, and the 
like. Embodinients may also be practiced in distributed computing environments where tasks arc 
performed by local and remote processing devices that are linked (either by hardwired links, wixelcss 
links, or by a combination thereof) through a conrununications network. In a distributed computing 
environment, program modules may be located in both local and remote memory storage devices. 
[0080] Although the above description may contain specific details, diey should not be constmed 
as limiting the claims in any way. The scheduler source code related to the invention is preferably 
written in C code and operates on a separate server but there arc no limitations on the software 
source code language or any particular hardware configuration. Other configurations of the 
described embodiments of the invention are part of the scope of this invention. Accordingly, the 
appended claims and their legal equivalents should only define the invention, rather than any 
specific examples given. 
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